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Abstract 

Composite likelihoods are increasingly used in applications where the full likelihood is analytically un- 
known or computationally prohibitive. Although some frequentist properties of the maximum composite 
likelihood estimator are akin to those of the maximum likelihood estimator, Bayesian inference based 
on composite likelihoods is in its early stages. This paper discusses inference when one uses composite 
likelihood in Bayes' formula. We establish that using a composite likelihood results in a proper poste- 
rior density, though it can differ considerably from that stemming from the full likelihood. Building on 
previous work on composite likelihood ratio tests, we use asymptotic theory for misspecified models to 
propose two adjustments to the composite likelihood to obtain appropriate inference. We also investigate 
use of the Metropolis Hastings algorithm and two implementations of the Gibbs sampler for obtaining 
draws from the composite posterior. We test the methods on simulated data and apply them to a spatial 
extreme rainfall dataset. For the simulated data, we find that posterior credible intervals yield appropri- 
ate empirical coverage rates. For the extreme precipitation data, we are able to both effectively model 
marginal behavior throughout the study region and obtain appropriate measures of spatial dependence. 
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1 Introduction 



1.1 Motivation 

The likelihood function is central to both frequentist and Bayesian inference, but in many modern settings 
it may be infeasible to calculate it, either because no analytical form is available, or because such a form 
is known but is computationally prohibitive. The first difficulty arises with max-stable processes, which 



available only for the b ivariate marginal densit ies 



Kabluchko et al 



1 



2009], though 



Genton et al 



Smith 



1990 


Schlather 


2002 


de Haan and Pereira 


2006 



201l| show that substantial efficiency gains are possible if 
trivariate margins can be used. The second difficulty may be experienced when dealing with Gaussian 
random fields on larg e lattices. Both of th ese problems and many other similar ones can be tackled using 



composite likelihoods. 



Padoan et al 



L0J 



20101 and 



Gholamrezaee 



based on marginal events to fit max-stable processes, and 



2010 ] propose the use o f composite likelihood 



Rue and Tielmeland 



2002l | have used composite 



likelihoods based on om i tting components of the full likelihood in approximating Gaussian random fields. 



Rvden and Titterington 



1998f describe the use of pseudo-likelihood, a form of composite likelihood, in 
simulation-based inference involving missing data, and show that their approach leads to a valid Markov 
chain simulation algorithm. 



Frequentist methods for composite likelihoods have been used for some time (for an overview, see 



Varin 



2008]), but little work has been done to explore how composite likelihoods could be employed in a Bayesian 



fram ework. The motivati ng a pplication for this wo rk is the spatial modelling of extremes. Recently authors 



(e.g.. 



Padoan et al 



L0] 



20101 and 



Gholamrezaee 



10] 



2010]) have used composite likelihoods to fit max-stable models, 



enabling the researchers to successfully model dependence between observations. However, the frequentist 
method s they employ may not be flexible enough to acc urately fit the marginal behavior across the study 



region. 



Coolev et al 



2007| and 



Sang and Gelfand 



)9] 



2009J have used Bayesian hierarchical spatial models to 



capture the marginal effects for spatial extremes, but have not used the max-stable process models suggested 
by extreme value theory to describe the dependence in the data. The goal of this work is combine these two 
approaches, and this entails appropriately deploying a composite likelihood within a Bayesian framework. 



1.2 Likelihood asymptotics for composite likelihoods 

Altho ugh it has numerous antecedents, the notion of a composite likelihood was crystallized by 



Lindsay 



1988], who defined it as a combination of valid likelihood entities. Consider a random vector Y £ M. K with 



2 



probability density function f(y; 9) where 9 £ M. p is an unknown parameter vector. Let : i £ I}, I C N, 
be a set of marginal or conditional events for Y and let {wi,i £ 1} be a set of non- negative weights. A 
composite likelihood is defined as 

Lc(0;y) = l[f(ye^i;0) Wi , (l) 

iei 

with corresponding log-composite likelihood 

t c {e;y) = Y J W i \ogf{y£^ V ,e). (2) 
iei 

Below we assume that n independent replicates Y 1 , . . . , Y n of Y are available, yielding a total composite 
likelihood and log likelihood of the form 

n n 

Ll ot (0; y) = H H f(y* £ ^; , ^(9; y) = £ £ Wi log ftf £ sf« 9), 

3=1 iei 3 = 1 iei 

and consider asymptotics as n — ¥ oo, with a fixed number of observations K in each replicate. The devel- 
opment below is simpler if we work with quantities that remain of order one as n — > oo, and we shall do so 
wherever possible. 

If the true likelihood is unavailable or difficult to work with, 9 is often estimated by the maximum 
composite likelihood estimator 9 C . Let #o denote the true value of the parameter. As each term on the 
right-hand side of equation ^ is a valid loglikelihood, the composite score function V£ t ° t (9;y) is a linear 
combination of unbiased estimating functions and so has mean zero. Under appropriate regularity conditions, 
therefore, the maximum composite likelihood estimator 9 C converges in distribution as follows, 

V^{H(9 ) J{9 y 1 H (9 )} 1/2 (9c -9 a )^N (0, Id p ) , n -> oo, (3) 

where M 1 / 2 denotes a matrix square root, i.e., {M 1 / 2 } 1 ' M 1 / 2 = M, ld p denotes the p x p identity matrix, 
H(9q) = —E[W 2 £ c (9q;Y)} and J(9q) = Vax[V£ c (0o',Y)], where the expectations are with respect to the full 
density. Both H(9q) and J(9q) are positive definite in a regular model, and both are of order one as n — > oo. 
Essentially the usual re gularity conditi ons for the asymptotic normality of the maximum likelihood es- 



timator as n —¥ oo apply 



Davison . 



20031 Sec. 4.4.2], but the parameter 9 must be identifiable from the 



densities appearing in The limiting distribution in equa t ion j3 ) also stems from the behavior of the 



maximum likelihood estimator under mis-specification 



Kent 



1982]. The maximum composite likelihood 



3 



estimator may thus be viewed as resulting from a mis-specified, or, more accurately, under-specified, statis- 
tical model, leading to consistent estimation but with a "sandwich" variance estimator of the type arising in 
longitudinal data analysis and many other domains. 



1.3 Bayesian inference with a composite likelihood 



Bayesian inference based o n composite likelihoods has be en little explored. Motivated by the spatial extremes 



problem mentioned above, 



Smith and Stephenson 



.1 



20091 ] use a pairwise likelihood and Markov chain Monte 



Carlo simulation to fit a max-stable model for rainfall at five sites in South- West England. They obtain 
a posterior by replacing the unavailable full likelihood with the pairwise likelihood, but although they 
ment ion that thi s subs titution may lead to overly precise inferences, they do not describe how to correct 
this. 



Pauli et al 



201 lj independently suggest the adjustment to the composite likelihood called by us the 
magnitude adjustment in Section 12.11 establish the asymptotic normality of the corresponding composite 
posterior and apply the method to a five-dimensional data set on air pollution. 

Related to Bayesian inference with comp osite likelihoods is work in Bayesian methods when one lacks 
or wishes to avoid using the true likelihood. 



Monahan and Boos 



1992] explore the validity of a posterior 



when the likelihood is not the conditional density of the data given the parameter, and propose both an 
alternative definition based on the coverage of posterior sets and a test that can be used to invalidate a 



particular replacement likelihood. 



Lazar 



)3] 



2003] applies this test when an empirical likelihood is used in 



place of a parametric one. Other work on Bayesian methods wi th conditional or pseudo likelihoods (e.g., 



Efron 



199^ . 



Chang and Mukeriee 



)6] 



2006] and 



Ventura et al 



2009f ) is typically motivated by a desire to avoid 



specifying a full likelihood when there are nuisance parameters, and thus focuses on Bayesian implementation 
using a pseud o-likelihood, which i s often a marginal, conditional or profile likelihood for the parameters of 



interest. Like iMonahan and Boost our ultimate aim is the practical one of using a composite likelihood to 
provide valid inferences; for example, the resulting posterior confidence sets should be correctly calibrated. 
Provided that J L' ot (0; y)Tr(9)dO is finite, we use ([1} to define a composite posterior density as 



Tc(f I y) 



L^(e-y)n{6) 



(4) 



where 7r(-) is the prior density. The first question arising is under what circumstances / L' ot (#; y)Ti{6)A9 < oo, 
so that is well-defined. In Bayesian analysis, integrability questions usually arise when discussing improper 
priors; but here we suppose that tt(-) is proper. Then a sufficient condition for Q to be proper is that for 



4 



each i there exists a finite hi such that sup e / (y € &/f, 9) < bi, since in that case 

f l}°\9-y)K{9)&9 = f nil/^ e ^;fl)""7r(fl)dfl < JJi?" 1 < °°- ( 5 ) 

The boundedness of /(y £ =0^; 0) holds in many cases, and in cases of doubt it can be imposed by recalling that 
in practice continuous observations are always rounded to some extent. The correct likelihood is therefore a 
product of probabilities obtained as differences of cumulative distribution functions, for which b% = \. The 
rounding is often ignored so that simpler density function approximations to the correct likelihood may be 
used, bu t if thes e approximations lead to difficulties, then we may choose to work with the correct likelihood; 



see, e.g., 



Copas 



1972]. As this rounding argument applies to any probability elements in (flj), and, with minor 
changes, also applies to the modified composite likelihoods used below, in practice we may always arrange 
that / L* ot (6»; y)n(0)d9 < oo and thus that © is proper. 

Since the composite likelihood is not the likelihood believed to have generated the data, the naive imple- 
mentation of a composite posterior may give misleading inferences, as we now illustrate. 

Example 1. Let be a stationary Gaussian process with unknown mean \x £ K and with covari- 

ance function j(h) = rexp(— h/uj), where the sill t > is unknown but the scale u) > is known. Let 
{y(x±), . . . , y(xic)} be one realisation of this process at locations x±, . . . , xk € K. Now consider a prior 
density of the form tt(9) = 7r(/x)7r(r), where 7r(/x) ~ N(a,b) and 7r(r) ~ IG(c, d), i.e., an inverse Gamma 
distribution with shape c and scale d. 

Here the prior densities are conjugate for ir(6 | y), so the full conditional distributions needed for Gibbs 
sampling are easily found to be 

n(fi \---)~N(jl,a 2 ), tt(t I • • • ) ~ IG jc + y , d + i(y - Ml) T S- 1 (y - M)} , 

where a 2 = (b^ 1 +r~ 1 l T £~ 1 l) , p, = a 2 (ab~ l +r _1 l T S _1 2/) and E is the correlation matrix derived 
from 7(-). 

The full conditional pairwise distributions are also readily available, and are 



7T p (/x | • ••) ~ N (Apj5-p) , tt p {t I ■ • • ) ~ IG < c H ,d + -(y p - fil) S p (y p - fil) \ , 



2 



where a 2 = (b 1 + r 1 l T T, p 1 l) , j2 p = a 2 (ab 1 + t 1 l T E p 1 y p ), Y* p is a block diagonal matrix with 



5 



Q) O 




CD LO 
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Figure 1: Marginal full and pairwise posterior densities for the mean /j, (left) and sill r (right), derived from 
n = 50 realisations of a Gaussian process observed at K = 20 locations having an exponential covariance 
function with fj, = 0, r = 1 and u) = 3. 



blocks 



1 



T l j(Xi~Xj) 



T 1 j(x i - Xj) 



1 <i < j <K, 



and y p = (yi , y 2 , 2/1 , 2/3 , ■ • ■ , 2/i, 2/2, 2/3, ■ • ■ , 2/2, 2/x, ■ • ■ -.Vk-uVkY ■ 



□ 



Example Q] shows that, as might be expected, the full conditional densities derived from the pairwise 
likelihood differ from those derived from the full likelihood. Since l T Al is the sum of all entries of the 
matrix A and S p is block diagonal, it is not difficult to show that 



r ! Tg(g-1) 



I 71 !]" 1 ! < K, 



so, in particular, 



g < (l+T)(T + bK) 

a 2 ~ (1 + r)r + ^(A" - 1) 



and when r is fixed, cr^/cr 2 \, as K — > oo. 

To illustrate this discussion, Figure [1] shows posterior marginal density estimates for fj, and r based on 
the composite and full likelihoods, found using a Gibbs sampler. These densities were obtained by taking 
the same setting as in Example Q] with /i = 0, r = 1 and w = 3, the last taken as constant in the sampling 
algorithm, with K = 20, and with the locations x\, . . . , taken uniformly at random in [0, 20]. There are 
n = 50 independent replicates of these data. A Gaussian prior with mean and variance 100 was placed on 
ji, and independently an inverse gamma prior with shape 1/10 and scale 1 was placed on r. The marginal 



6 



composite posterior densities are much too concentrated, because the pairwise likelihood treats the pairs of 
observations as though they were mutually independent and thus uses each observation repeatedly — see the 
definition of y p in Example [T] 

The aim of this paper is to propose a framework for approximate Bayesian inference from composite 
likelihoods when the full likelihood is not available. Our aim is to obtain composite posterior distributions 
that give credible intervals with reasonable coverage. Section [5] introduces two adjustments to the composite 
likelihood that are intended to retrieve some of the desirable properties given by the usual likelihood. Sec- 
tion [3] shows how these adjustments can be incorporated into Markov chain Monte Carlo samplers, and their 
performance in simulation studies is discussed in Section 2] Section [5] gives a case study on the modelling of 
extreme rainfall around Zurich. The paper closes with a brief discussion and two technical appendices. 

2 Adjustment of the composite likelihood 

We ultimately wish to perform a Bayesian analysis, in which setting there is no "true" parameter value 8q. 
However, we use asymptotic relationships developed under the frequentist paradigm to adjust the likelihood 
to obtain appropriate inference for the composite posterior, and thus speak of 9q throughout this section. 

The theory of unbiased estimating functions applied to the score functions of composite likelihood implies 
that under suitable regularity conditions, the modes of a composite posterior and of the full posterior density 
will approach one another as the sample size increases; see Figure [T] However, the figure also shows that 
the composite posterior density can differ significantly in spread from the true one, because the composite 
likelihood treats the events i £ 1} as though they were mutually independent. Below we seek to modify 
the composite likelihood in order to mitigate this. 

Suppose that the parameter 9 = ((j) T , i/j t ) t has true value 9o = (4>q , ipo) T an d that ip contains q elements. 
Let 9 be the restricted maximum likelihood estimator, obtained by maximizing the full log likelihood l(9\ Y) 
over 9 with ip held fixed at ipo and let 9 C be the restricted maximum composite likelihood estimator, which 
maximizes (JTJ with ip held fixed at tpQ. Then as n — >• oo, 




(6) 



7 



whereas for the composite likelihood, 



A c (Vo) = 2{4(0 C ; Y) - l c (6 c ; Y)} A x * x * 



(7) 



where X\, . . . , X q are independent \i random variables, Ai, . . . , X q are the eigenvalues of the q x q matrix 
jH(do)^ 1 J(6q)H(6 q)~ 1 }j i, \^H(6 q)'' 1 }^,]~ 1 , and A$ denotes the sub-matrix of a matrix A corresponding to 



the elemen ts of ~0 



ratio tests 



Kent 



1982] . These relationships have 



Rotnitzkv and Jewell , 



1990; 



Chandler and Bate 



previo usly been exploited to provide likelihood 



2007] suitable for misspecified models. Here we 



aim to recover convergence in distribution to the usual \ 2 distribution through two different modifications 
of the composite likelihood: a magnitude adjustment and a curvature adjustment. The reasons for such 
modifications is to make the composite likelihood ratio, which appears in the Metropolis-Hastings algorithm 
but is hidden in the Gibbs sampler, behave in distribution as it would if a full likelihood were available. In 
the remainder of this section we consider only the case where ip has dimension zero, but in Section [3.2.21 we 
will show how partitioning 9 can yield better coverage. 



2.1 Magnitude adjustment 



Rotnitzkv and Jewell 



JO] 



1990], who, 



The magnitude adjustment to the composite log likelihood is inspired by 
in the context of hypothesis testing in longitudinal studies, estimate Ai, . . . , X q from estimates of H{9q) and 
J(Oo), and use them to calculate the appropriate rejection region for the \ 2 test based on (JT)). 
We define the magnitude adjustment by 



£m, S n(0;y) = k£ t c ot (9;y), 9 € 9, 



(8) 



Pauli et al 



where k is a positive constant; §E§ was also suggested by 
n — > oo we have 

A-magn ( V'O ) — 2{£ ma g n ($ c , Y^j ^magn(^ci 

and 



201l|. With this modification and as 
Y)}^kJ2\ t X t (9) 



E[A magn (V> )] — > kJ2 Ai, Var[A magn (V>o)] — > 2fc 2 £ \j. 

i=i i=i 

Setting k = q/J2i=i ^* therefore ensures that E[A magn (^>o)] converges to E[x^] = q, but the higher moments 

of © will not match those of \\ unless all the A^'s are equal or q = 1. For our purposes, we consider the 



8 



case where <p has dimension zero, i.e., k = pj 52f=i ^* where Ai, . . . , A p are the eigenvalues of H(9q) 1 J(0q). 



Varin 



2008] proposes a Satterthwaite adjustment to match the first two moments of A magn (V>o) and \ 2 , 



though their higher moments would still differ. 
2.2 Curvature adjustment 

Another strategy is to modify the curvature of the composite likelihood around its global maximum 9 C by 
considering the adjustment given by 



4urv(0; y) = 4 ot (<?*; v), 0* = o c + c(9 - e c ), 



(io) 



for some constant p x p matrix C. Clearly 9 C is also a global maximum for £ C urvj and 



W CU rv(0;j/) = C 1 ver(6;y)\ e=e *, V 2 4urv(0;y) = G 1 V 2 ir(9;y)\ e=e *C. 



Under mild conditions, Taylor expansion of the usual log-likelihood and the asymptotic normality of the 
maximu m likelihood es timator 9 yield convergence of the likelihood ratio statistic in distribution to a y 2 
variable 



Davison 



2003, Sec. 4.5]. More precisely, the facts that 



A(6» ) -A n{0- 6» ) T S(6»- 9 ), n^oo, 



for some q x q covariance matrix £ depending only on E[V 2 ^(#o; Y)] and 



V^E 1/2 (6i - 9 ) -A JV(0,Id p ), n — > oo, 



ensure that A(#o) converges in distribution to a Xp variable. This occurs because — n~ 1 W 2 £(9; y) converges al- 
most surely to the rescaled inverse of the asymptotic covariance matrix of the maximum likelihood estimator, 
the Fisher information in a single Y . 

This suggests that we should try to ensure that — n V £ CUTV (6 C ; y) converges almost surely to the inverse 
of the asymptotic covariance matrix of the maximum composite likelihood estimator, i.e., H(9q)J(Qq) • H(6q), 
by taking any semi-definite negative matrix C such that 



C T H(9 )C = H(9o)J(9o)- 1 H(9 Q ). 



(11) 



9 



One possible choi ce, C = M^Ma, where M jM A = H(9 )J{9 Q )- 1 H(9 a ) and M T M = H(9 ), corresponds 



to a suggestion of 



Chandler and Bate 



20071 ] for hypothesis testing for clustered data using the independence 



log-likelihood. However, the matrix square roots M and Ma are not unique, and although the choice is 
immaterial for composite likelihoods that are quadratic in the neighborhood of 9 C , it might be necessary to 
ensure that the mapping (|10[) preserves any directions of asymmetry. For this reason we use singular value 
decompositions for M and Ma for the curvature adjustments in this paper. 



2.3 Properties of the adjustments 

Although both adjustments rely on the idea of recovering the usual convergence to a \ 2 variable, they express 
different aspects of this. The magnitude adjustment © is an "overall" adjustment, intended to scale the 
composite likelihood down to the appropriate magnitude; in Figure [T] it amounts to raising the narrower 
curve to a power and thus giving a nonlinear transformation of the vertical axis. Therefore all (local) 
extrema are left unchanged, because V£ mag n(#;2/) = implies that V t t ° t (9\y) = 0, and the composite and 
full posterior modes should be approximately the same, because the composite score function has mean zero. 
The curvature adjustment (|10[) . on the other hand, stretches the horizontal axis linearly so that the curvature 
of £ C urv{9', y) at 9 C matches that of the large-sample log-density of 9 C ; thus this changes the locations of any 
local maxima other than the global maximum at 9 C . Therefore the magnitude adjustment might be more 
appropriate if the full posterior distribution is multi-modal. 

However only the curvature adjustment ensures that the convergence to a x 2 distribution is met; the 
magnitude adjustment only gets the first moment correct. This may have a strong impact on the shape of 
the composite likelihood around 9 C , and therefore on the composite posterior density. 

2.3.1 Asymptotic posterior distributions 

We now derive the asymptotic properties of the composite posterior distributions, both adjusted and unad- 
justed. Provided that the unadjusted composite posterior is a valid distribution, it can be shown under the 
usual regularity conditions that when n is large enough (Appendix [X| , 

7r c (9\y)^N{9 ,n- 1 H(9 )- 1 }. (12) 

Here and below we abuse notation; (|12p means that 9 has the stated distribution, conditional on y, not 
that the posterior density has a distribution. Unlike in the usual case, the unadjusted composite posterior 



10 



distribution does not converge to the asymptotic distribution of the composite likelihood estimator, given 
by ©. 

In their investigation of the asymptotic distribution of the the magnitude-adjusted posterior, 



Pauli et al 



2011 



page 8] state that the post erior has approxim ately the correct variance "by the % 2 approximation 



for the null distribution." Further 



Pauli et al 



2011 



pages 8 & 9] state that the approximation is asymp- 
totically correct when p = 1, and argue that the approximation represents an improvement over the naive 
composite posterior when p > 1. To expand on this, as the scaling constant estimate k = p/Y^i=i ^» use d 
for the magnitude adjustment converges almost surely to p/tr{H (9q) _1 J (9q)} as n oo, we conclude that 
(Appendix 0, 

^ma g „(0 | y) ~ n{6 ,{np)-hv{H{e )- 1 J{6 )}H{e )- 1 } . (13) 

Thus unless 9q is scalar, i.e., unless p = 1, 7r magn will differ from the asymptotic distribution given by (|3|). 
Compared to (|12p. the asymptotic variance is inflated, because tr{_ff (^o) -1 J (do)} > P', see Appendix [B] 
Since the curvature adjustment obtains the correct curvature, it is straightforward to see that 



1 y) ~ N{e ,n- 1 H(e )- 1 j(9o)H(e r 1 } 



(14) 



which is exactly the asymptotic distribution of the maximum composite likelihood estimator. 
2.3.2 Comparison of the Adjusted Likelihood to the Full Likelihood 

The magnitude or curvature adjustment will ensure only that the distribution of the corresponding adjusted 
composite likelihood ratio, A a dj(#o) = 2{£ a dj(9 c ;y) — ^adj(#o; y)}, will approximate the Xp distribution of the 
true likelihood ratio, A(6q). However, since the composite likelihood should contain some of the information 
in the full likelihood, one would hope that A a dj(#o) 1=3 A(# ), i-e., that the values of these ratios should be 
related. Figure [5] compares values of A curv (0o) and A(#o) f° r 200 datasets simulated as described in ill. 31 
Their correlation is f = 0.64 when the number of replicate Gaussian processes is n = 50, and f — 0.79 when 
n = 500: reasonable correlations, but not overwhelming. 

Our aim in adjusting the likelihood is not to approximate the true likelihood — and in turn, approximate 
the full posterior — but rather to obtain appropriate inference from a composite posterior. If we did wish to 
approximate t(Q\) at 9\ £ O, then it can be shown using the curvature-adjusted likelihood that 2{£ CUIV (6 C ) — 
4urv(#i)} A X T X, where X ~ N({H(6 ) J- 1 (9 Q )H(6 )} 1 / 2 (9i - 9 Q ), Id p ), whereas 2{l(B) - l(6 x )} 4 Y T Y, 
where Y ~ N{I(9q) 1 ^ 2 (9i — do), Id p } and I(9q) is the Fisher information matrix based on the full likelihood. 
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5 10 15 20 0.00 0.30 5 10 15 20 0.00 0.25 

Figure 2: Comparison of 200 likelihood ratios for the Gaussian process simulation with n replicates: 
A CU rv(#o) for the curvature-adjusted composite likelihood (y-axis) versus A(9 ) (x-axis). Left: n = 50. 
Right: n = 200. 

Obviously, the approximation will degrade as {0\ — 9o) grows. Since the true likelihood and information about 
1(9$) would not be available in a realistic application, it seems unclear how to improve the approximation to 
the true likelihood away from 9q. Simply put, by not having the full likelihood available, we lose information. 



3 Markov chain Monte Carlo samplers 

This section describes implementations of Markov chain Monte Carlo basing Bayesian inference on com- 
posite likelihoods. One must take care to show that MCMC algorithms will converge to the correct target 
distributions, as composite likelihoods, adjusted or not, are not valid likelihoods. We describe the adjusted 
Metropolis-Hastings algorithm and the Gibbs sampler in turn. 

3.1 Adjusted Metropolis— Hastings algorithm 

In Section 2 we suggested two adjustments intended to provide approximations to the full likelihood ratios. 
We now discuss an adjusted Metropolis-Hastings algorithm, given in Algorithm [IJ and verify that it has the 
desired stationary distribution. 

Implementation with one of the adjusted likelihoods, L roa g n (0; y) or L CUTV (9;y), requires only a pre- 
liminary maximisation of the composite likelihood to estimate the matrices H(9q) and J{9q) for the ad- 
ustment. The argument that establishes detailed balance for the original Metropolis-Hastings algorithm 



Robert and Casella . 



20051 Theorem 7.2] applies to Algorithm [TJ and it can be shown that apart from nor- 
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Algorithm 1: Adjusted Metropolis-Hastings algorithm. 

Input : 9 C , H{d c ), J(0c), #i G 0, a proposal distribution q(- \ 9) and an adjusted composite 

likelihood L adj (-;y) 
Output: A realisation of length N + 1 from a Markov chain 

for t <- 1 to N do 

e(P) ~ q (. I 

a a-(8® 9 (p) ) <- minll W0 (p) ;yMe (i >(e (t) ) . 

a a d}{0 ,o j ^ mm L ^ (e(t) . y)7r( . e(t))(j(e(p)|e(t)) j, 

17 ~ 17(0, 1); 

if a ad j(0 (t) ,0 (p) ) < C/ then 
| e (t+i)^ e {p). 

else 

| g(t+i) 0(t). 

end 
end 

return {0 (t) }t=i,...,jv+i; 



malizing constants, the stationary distribution of the Markov chain is given by 

L™(6;y) k Tr(9), k = p/Y^X i , (15) 

for the magnitude adjustment and 

exp{4urv(0;y)}7r(0) (16) 

for the curvature adjustment. The stationary distributions (fT5|) and (fl6|) should provide better coverage 
than if an unadjusted composite likelihood was used. 

3.2 Gibbs sampling 

When the unknown parameter 9 has low dimension, Algorithm Q] should provide approximate inference for 9 
without too much Monte Carlo effort. For models in which 9 is of high dimension, however, the probability of 
acceptance may be too low for Algorithm Q] to be viable, and then the parameter vector is often partitioned 
and Gibbs sampler employed. Let us write 9 = (Of , . . . , 9q) t , where 6j <E R Pj ' and Ylj=i Pj = an d suppose 
that we wish to draw from 

tt(9 \y)ocL(6;y)Tr(e). (17) 
A typical implementation of a Gibbs sampler will successively draw from 

tt(^ \e-j,y) ocLto \6- h y)'K(6 j ), j = l,...,G, (18) 
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where 9-j is the parameter vector 9 with the elements of 9j removed. In this section we propose two Gibbs 
samplers for use with composite likelihoods. 



3.2.1 Overall Gibbs sampler 

Since the true likelihood is unobtainable, we use the Gibbs sampler with an adjusted composite likelihood. 
We could replace L(9] y) in (|17[) with L a ^(9] y), where L a( jj is either the magnitude- or the curvature-adjusted 
composite likelihood. To perform Gibbs sampling, 9 C , H(9 C ) and J(9 C ) can be estimated once prior to running 
the algorithm, and L a dj(#;y) can be calculated. Gibbs sampling then proceed s as usual. 



Robert and Casella . 



2005, sec 



As the Gibbs sampler is a special case of the Metropolis-Hastings algorithm 
10.2.2] and it was shown in Section lOl that the latter could accommodate an adjusted composite likelihood, 
this overall Gibbs sampler algorithm converges to the stationary distributions given by (fT5|) or (|TB| . 

3.2.2 Adaptive Gibbs sampler 

In real problems, the dimensions of 9, and hence of 9 C , can be quite large. By finding 9 C , H(9 C ) and J(9 C ) 
only once before implementing the algorithm, the overall Gibbs sampler loses the 'spirit' of Gibbs sampling, 
which is to sample the lower-dimensional 9j given the current value of 9-j. 

An alternative to adjusting the likelihood in (IT71) is to replace the likelihood in (fTS)) by an adjusted 
composite likelihood. That is, the likelihood for 9j can be adjusted based on the current values of 9-j. 
Since this adjustment requires knowledge of the maximum composite likelihood estimates, the value of 
#j.c | 9-j = 6_j must be found at each step. This approach has the advantage that the adjusted composite 
likelihood approximation using the current value of 9-j should be more accurate, as the approximation is 
made in a lower-dimensional parameter space. In particular if 9j is scalar, then the magnitude adjustment of 
the composite likelihood ratio statistic is exact; see ©. This adaptive Gibbs sampler is given in Algorithm^ 
It can be shown t hat A lgorithm [2] corresponds to a well-defined posterior by considering the completion 
section 10.1.21: 



Robert and Casella . 



2005 



G 

*0,O\v) = H*0j\8,vM8\v), 
j'=i 



where ir(9 | y) represents the target density. Note that 

G 

*{0\V) = J Yi^0j I 0,y)ir{9 \ y)d9 
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Algorithm 2: Adaptive adjusted Gibbs sampler. 
Input : 0« e e 

Output: A realisation of length N + 1 from a Markov chain 

for < <- 1 to AT do 
for j <— 1 to G do 

Get the restricted maximum composite likelihood estimate 0j jC with 9-j held fixed at 9^; 

Get Hj,j(6j) = V 2 £ c (9j, c \ 6_j,y) and Jj,j(9j), the sample covariance matrix of 

~S/£c{6j,c | i = 1, • • • and define the adjusted composite log-likelihood <? a dj(#j | y) 

from either © or ([ID]): 

Draw 8j t+1 ^ from L a ^ i (9j-,y,9^-)'K(9j \ 9-j) (using Metropolis-Hastings updates if necessary); 
end 
end 

return {0 (t) }t=i,...,w+i; 



as required for a completion, provided that Tr(6j \ 0, y) is a valid density. Define 

\0,y) = 5argmaxL c (e 3 |e_ j: y)(fi), 



that is, a Dirac measure on the value of 9j that maximizes the composite likelihood given the current values 
of j . If the maximum composite likelihood estimates 9j could be found analytically, then Algorithm [5] 
would simply be a Gibbs sampler on the completion. Since 6a must be obtained numerically, convergence of 
the Markov chains must be carefully checked by examining the output. 

In the context of the adaptive Gibbs sampler, both the magnitude and curvature adjustments must be 
understood as adjusting the conditional likelihood L c (9j \ 9-j,y). That is, k in equation ([5]) now becomes 
Pj/Y2iLi where Xi are the eigenvalues of the matrix defined by H(9j) and J(9j). Similarly, the matrix C 
in dHJ) is defined by H{9j) and J{9 j ). 

It is instructive to tie each of the Gibbs samplers to the asymptotic distribution of the posterior. Let 
7r [6 | y) denote the composite posterior distribution evaluated at 9 £ M. p and further assume that the 
asymptotic posterior distribution corresponds with that of the maximum composite likelihood estimator, 

lo g7 r(0 |y)dc 9o) T H(9 )J- 1 (9o)H(9 )(9 - 9 ), (19) 

where 6c means 'asymptotically proportional to'. Gibbs sampling for a given partition 9 = (9j,9-j) T , where 
9 e K Pj and 9-j G W~ Pj , would involve drawing from n(9j | 9-j,y), the conditional posterior distribution 
of 9j given some fixed value for 9-j . 
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In the overall Gibbs sampler, one begins by approximating (1191) with 



lo g7 r adj (0 \y)<k - 9 C ) T H^{9 C ){9 ~ 9 C ). 



(20) 



where H ad ^(9 c )^ 1 is the covariance matrix in ([TU1) or (TT^|) for the magnitude- and curvature-adjusted poste- 
riors respectively. 

Let 9 C = (9 c j,9 C} _j) T and partition 



H ad H 



H 



adj , 



H 



adj 



H 



adj , 
-j,3 1 



H 



adj i 

-3,-i y 







Since (|20l) implies that 7r ac ij { (#j , 0_j) T } is approximately a Gaussian density with mean (0 c ,j> ®c,-j) T and 
covariance matrix £ = if ■'(^ c ) _ , we see that 7r a dj(0j | 9-j,y) is approximately Gaussian, with mean 



(21) 



and covariance matrix 



^3,3 ~ — HjJ( 



(22) 



The adaptive Gibbs sampler makes its approximation later in the algorithm. Starting from (|19[) , let 9-j 
be given and consider log7r(0j | 6-j,y). By partitioning 9 and H(9o)J~ 1 (9o)H(9o), it is straightforward to 
show that the asymptotic conditional posterior is 



log7r(0j | 9-j, y) 6c (6j - H^-jf^^yL^ - fij\-j) 



(23) 



where 



and 



= 6°,3 - {H(9 )J- 1 (9 )H(9o)}^ {H(9o)J- 1 (9 )H(9 )} i _ j (9^ - O ,-;), 



A 3\-3 



{H{9 )J- 1 {9 )H{9 )}-] 



(24) 



(25) 



analogous to (f2"Tj) and (j2"2")l above. The adaptive Gibbs sampler makes its approximation to the conditional 
distribution, estimating the conditional mean by finding c> j\_j, the value which maximizes the, lower- 



dimensional, conditional composite log-likelihood £, 



rWj,c I 



,y), and then adjusting this lower-dimensional 
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likelihood to obtain an estimate for {H(9q)J^ 1 (9o)H(9q)^ . 

The advantage of the overall Gibbs sampler is computational and in its simplicity; the adaptive Gibbs 
sampler's need to estimate 0j at every step slows it tremendously. The potential gain from the latter is that 
the approximation made by employing a composite likelihood is made only for the subvector 9j and is done 
with knowledge of the current values of the other parameters. In the next section we explore by simulation 
whether the adaptive Gibbs sampler improves overall estimation. 



4 Simulation study 



In this sec tion, we use simulation to a ssess the performance of the magnitude and the curvature adjustments. 



Following 



Monahan and Boos 



1992] . we assess whether our adjustments yield posteriors that are valid by 



coverage, i.e., whether Pr[0 G GI a {Y)] = a, under some probability measure for 9 dchned on and some 
credible intervals CI Q with level < a < 1. 

We first apply the proposed adjustments to the stationary isotropic Gaussian process of Section [1.31 and 
compare the results obtained using the adjusted composite likelihood to those using both the full likelihood 
and the naive composite likelihood. We then focus on spatial extremes by considering a Bayesian hierarchical 
model involving max-stable processes. 



4.1 Gaussian processes 

We again consider a one-dimensional stationary Gaussian process with mean /x £ R and an exponential 
covariance function j(h) = rexp(— h/u), r > 0, lj > 0. We examine two different forms of dependence, 
allowing ui to equal 3 and 1.5, which respectively yield effective ranges for the covariance of roughly 9 and 
4.5. The priors on /j, r are those reported in Section Q] while an inverse Gamma density with shape 1/10 
and scale 1 is assumed on uj. The stochastic process is replicated n — 50 times in each simulation and is 
observed at K — 20 locations uniformly generated in the interval [0,20]. The simulation was repeated 500 
times to assess coverage, with fi = and r = 1 in each case. 

Figure [3] compares the posterior densities obtained from the full likelihood, the unadjusted pairwise poste- 
rior, and the adjusted composite posterior distributions using the magnitude and the curvature adjustments 
from a single simulation. There is a large improvement due to the adjustment. Owing to the asymptotic 
unbiasedness of the maximum composite likelihood estimator, the modes of the marginal composite posterior 
distributions are close to those obtained from the full likelihood. The use of the adaptive Gibbs sampler 
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Figure 3: Comparison between the marginal full posterior (black), the marginal pairwise posterior (red) 
and the marginal adjusted pairwise posterior densities based on the magnitude (green) and curvature (blue) 
adjustments. The posterior distributions are derived from n = 50 realisations of a Gaussian process having 
an exponential covariance function with fi = 0, r = 1 and uj — 3 and observed at K = 20 locations. Top 
row: Metropolis-Hastings algorithm. Bottom row: Adaptive adjusted Gibbs sampler. 

for the magnitude adjustment seems to improve the approximation to the full posterior, particularly for the 
range parameter uj; recall from Section 13.21 that this is not an overall magnitude adjustment. The adaptive 
sampler used here has three blocks each comprising a single parameter. 

Table [T] summarizes the empirical coverages based on 500 replicate data sets. Overall, the adjustments 
give reasonable credible intervals, whereas the naive composite posterior has poor coverage. The Metropolis- 
Hastings algorithm and overall Gibbs sampler have the same stationary distribution and give the same cover- 
ages for each adjustment. The curvature adjustment performs better overall than the magnitude adjustment, 
particularly for the mean and range parameters u, and uj. The improvement in coverage due to using the 
adaptive Gibbs sampler appears greater for the magnitude adjustment than for the curvature adjustment, 
partly because there is more room for improvement, and because the latter was already adjusting each 
element of 9 differently. The curvature and adaptive magnitude adjustments yield the best coverages. 

Figure^ which complements Table[T]by showing how the empirical coverages depend on the credible level 
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Table 1: Empirical coverages (%) for nominal 95% credible intervals based on 500 Gaussian process simula- 
tions. "Full" denotes coverage with the full posterior, "Magnitude" corresponds to the magnitude adjusted 
posterior, "Curvature" to the curvature adjusted posterior, and "Unadjusted" to the naive composite pos- 
terior. 
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for the overall and adaptive Gibbs samplers, corroborates the conclusions drawn from Table [T] Compared 
to the unadjusted composite posterior, the proposed adjustments clearly improve the coverages and seem 
to yield essentially the same coverages as the full posterior, though the latter provides shorter intervals, if 
it is available. The adaptive Gibbs sampler for the magnitude adjustment performs better than its overall 
counterpart, indicating that the latter might not be flexible enough to provide the correct coverages for each 
element of the parameter vector. The curvature adjustment again seems to be improved less by the adaptive 
version of the Gibbs sampler. 

Figure |4] shows that the proposed adjustments have good coverage properties, but it is also interesting 
to check to what extent the composite posterior distributions share common features with the full posterior. 
Figure [5] shows boxplots of the first four centered moments of the estimated posterior distributions. As 
one would expect from the fact that the composite likelihoods give unbiased estimating equations, the first 
moments of the composite posterior distributions, including the unadjusted one, match those of the full 
posterior. The variance of the unadjusted pairwise posterior distribution is much too small, but those of the 
adjusted posterior distributions are closer to that of the full posterior. The magnitude adjustment combined 
with the overall Gibbs sampler has a smaller variance for the mean fi and a larger one for the range uj; this 
clarifies why Table [T] shows that this particular adjustment tends to undercover fi and overcover uj. Except 
for fi, none of the adjustments gives the correct skewness and kurtosis, though the magnitude adjustment 
is slightly better. Nevertheless, both adjustments can capture the first two moments well, and despite the 
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Credible level a Credible level a Credible level a 

Figure 4: Variation of the empirical coverages with the credible level a, based on 500 replicates of the 
Gaussian process simulation with fj, = 0, r = 1 and oj = 3, for the fuii, the non adjusted pairwise and the 
magnitude/curvature adjusted posteriors. Top row: Overall Gibbs sampler. Bottom row: Adaptive Gibbs 
sampler. 

degradation of the approximation with distance from 9q 7 yield coverage rates which are very comparable to 
those obtained using the full likelihood. 

Finally, we investigate the difference between the magnitude- and curvature-adjusted posteriors and the 
effect of the dimension of the blocks used in the adaptive Gibbs sampler. As noted in Section 12.11 the 
magnitude adjustment will recover the y 2 null distribution only if the dimension of 9j is one. In addition to 
running the adaptive Gibbs sampler with /x, r, and uj each serving as its own block, we also ran a two-block 
version of the adaptive Gibbs sampler with 6\ = /x and 02 = (t,uj) t . For the individual block version of the 
adaptive Gibbs sampler, there is virtually no difference in the estimates of the magnitude- and curvature- 
adjusted posteriors, reflecting that both adjustments adequately capture the information contained in the 
composite likelihood. However, for the two-block version of the sampler, the empirical posterior correlations 
of r and uj differ: the curvature- adjusted posterior gives Cov(r, uj) « 0.69, whereas the magnitude-adjusted 
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Figure 5: Boxplots of the sample centered moments of the estimated posterior distribution for each of the 
500 simulations (u. = 0, r = 1, ui = 3) for the full posterior (full), the unadjusted pairwise posterior (pair), 
the magnitude adjusted composite posterior (magn) and the curvature adjusted composite posterior (curv). 
Red boxplots: Overall Gibbs sampler. Green boxplots: Adaptive Gibbs sampler. 



posterior gives Cov(t, uj) s» 0.33. It is difficult to estimate both the sill and range parameters of a Gaussian 
process [?], whose ratio t/uj is important for applications such as interpolation. The 95% credible intervals 
for this ratio have an empirical coverage rate of 96% for the curvature-adjusted posterior, but a coverage 
rate of 100% for the magnitude-adjusted posterior. This suggests that for the two-block Gibbs sampler, 
the magnitude adjustment fails to fully capture the relationship between these two parameters, thus giving 
further evidence that the curvature adjustment is to be preferred, since it seems to provide output that can 
be used more flexibly. 
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4.2 Bayesian hierarchical model for spatial extremes 

Let Y m (x), x € T> 1 m > 1 be independent replications of a stochastic process. Asymptotic theory for extremes 
implies that, provided the limit exists and is non-degenerate, the process 



max a n (x) l {Y m (x) — b n (x)} 

m— l,...n 



de Haan 



19841. Given observations that arise 



converges weakly to a max-stable process Z(x) as n — > +oo 
as block (e.g., annual) maxima, it is therefore natural to approximate their joint distribution using such a 
process. The univariate marginal distributions for such a process will be generalised extreme-value (GEV) 
distributions, which depend on three parameters. 

Although the general methodology we propose could be applied with any max-stable model 



Schlather 



199o| . 



2002 



Kabluchko et al 



1 



Smith 



2009l | . we focus here on the Gaussian extreme value process of 



1990: 



Smith 



Z(x) = maxCfci^a; - s k ) 



k>\ 



(26) 



where {((k, Sk)}k>i are the points of a Poisson process on (0, oo) x T>, with T> C R d , having intensity 
dA(£, s) = £~ 2 dCds, and ip is the zero mean <i-variate normal density with covariance matrix S. As formu- 
lated, Z{x) has unit Frechet mar gins and its bivariate and trivariate marg inal distributions can be used to 



construct a composite likelihood 



Padoan et al 



2010; 



Genton et al 



2011 - 



A simple approa ch to fitting max-stable models is to employ a pairwise likelihood 



Gholamrezaee . 



Padoan et al 



2010: 



2010J. To account for non-stationarity in the marginal distributions, it is convenient to assume 



that the GEV parameters follow response surfaces that depend on location and on covariates such as altitude. 
Often, however, the available covariates do not fully explain the variation of the marginal distribution over 
the study region. One approach to capturing the regional effects is to construct a hierarchical model in which 
the marginal parameters of the extreme value distribution follow a stochastic process, such as a Gaussian 
process, over the study region. 

Our approach is to use a max-stable process model within a hierarchical framework; the max-stable 
model provides a theoretically justified model for the local dependence, i.e., the spatial dependence of the 
extremes, and the hierarchy allows for flexibility in modeling how the regional effects influence the marginal 
behavior. The difficulty is that the full likelihood is unavailable, and so fully Bayesian inference cannot be 
performed. Instead we employ one of the adjusted MCMC samplers suggested in Section [3] 
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Our chosen model has the data-process-prior framework of most hierarchical models: 



A* I P^t^u^ 
loger | f3 a ,T a ,uj a 
£ I /3 4 ,t ? ,w 5 



Smith's max-stable model, 
GP (X^, 7/1 ), 
GP(X ct /3 ct , 7ct ) 
GP (X^, 7e ), 



where /x, er, £ represent the three GEV parameters, GP(m, 7) denotes a Gaussian process with mean m and 
covariance function 7, the 7. 's represent exponential covariance functions with corresponding sill and range 
parameters r. and uj., and the /3. are regression coefficients associated to the design matrices X.. 

The prior level places independent priors on all parameters introduced at the process level. We take 
conjugate normal priors for all regression parameters f3. , conjugate inverse gamma priors for the r. , gamma 
priors for the range parameters uj. and a Wishart prior for the covariance matrix £ appearing in the Smith 
model. In all cases, the prior variance is set to be large so that the prior densities, though proper, are 
relatively flat. 

We performed a simulation study to evaluate our approach. Gaussian processes were simulated for n(x), 
cr(x), and £(2;), with fi(x) and <j(x) dependent, and with values similar to those found for annual maximum 
rainfall data. Then, 50 max-stable processes with marginals given by er(x), and were simulated 

according to the Smith model. Fifty locations were chosen and the 50 observations at each location were 
used to fit four models: 

Ml the hierarchical model with a conditional indepe ndence assumption i n t he data layer, yield i ng a p roduct 



of K independent GEV densities, analogous to 



Coolev et al 



2007] or 



Sang and Gelfand 



2009 1 



M2 the max-stable process hierarchical model with no adjustment; 

M3 the max-stable process hierarchical model with an adaptive curvature-adjusted Gibbs sampler; and 



M4 the max-stable pro cess model where th e marginals are described by a response surface in the covariates 



proposed by 



Padoan et al 



20K 



The left panels of Figure [5] show boxplots of the differences between the true GEV parameters and all 
the states of the Markov chains for four different stations, with the asymptotic 95% confidence intervals for 
the max-stable response surface model. The right panels of Figure [6] display the coverage rates, for all 50 
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Figure 6: Boxplots of the difference between the true GEV parameters and all the states of the Markov 
chains for four stations (left panel). For each of the stations the boxplots are (from left to right) the 
conditional independence model (Ml, red), the non-adjusted hierarchical model (M2, green), the hierarchical 
model with the curvature adjustment within an adaptive Gibbs sampler (M3, blue), and the asymptotic 
95% confidence limit from the max-stable response surface model (M4, grey). The right panel shows the 
proportion of the credible intervals at level a = 95% containing the true GEV parameters. 



stations, of the 95% posterior credible intervals for the three hierarchical models, with the 95% confidence 
intervals for the max-stable response surface model. As expected, the unadjusted max-stable hierarchical 
model produces a posterior that is too concentrated and yields very poor coverages, and the max-stable trend 
surface model is not flexible enough to account for the complicated regional behavior of the GEV parameters, 
as evidenced by the poor point estimates in the box plots and the corresponding poor coverage rates. The 
adjusted max-stable hierarchical model and the conditionally independent hierarchical model produce very 
similar posterior distributions and have similar coverage rates, although the max-stable model does slightly 
less well. 

The advantage of the max-stable hierarchical model over the conditional independence model is that the 
former can account for local dependence; even with only 50 locations in the region, it seems to be able to 
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Figure 7 : Comparison between one realization of the observed Geld and one realization of the different mod- 
els analyzed. From left to right: observed held; conditional independence model; and max-stable hierarchical 
model with adjustment. The same seed was used for each simulation. 

detect the true pattern of local dependence. The 95% credible intervals for the elements an, a 12 and 022 of 
S are (5.39,8.76), (-1.28,0.67) and (5.58,8.37), which include the true values 6, 0, and 6. The fitted max- 
stable model provides a mechanism for producing realistic draws from the spatial process. As Figure [7] shows, 
a draw from the posterior distribution of the conditional independence model would be inappropriate and 
unrealistic for spatial phenomena such as rainfall or temperature, annual maxima of which would produce 
smoother surfaces. 

These results are obtained from a (near) perfect model simulation; that is, the max-stable hierarchical 
model fitted to the data was nearly identical to that from which the data were simulated. Nevertheless, 
this simulation exercise shows that the adjusted max-stable hierarchical model can flexibly model marginal 
behavior that captures regional spatial effects and can capture local dependence through the max-stable 
process model. Despite the approximation due to employing a composite likelihood, the inference obtained 
appropriately captures the uncertainty associated with the estimation. In the next section we show that it 
also seems to perform well on real data. 

5 Application 

We analyze data on maximum daily rainfall amounts for the years 1962-2008 at 51 sites in the Plateau 
region of Switzerland; see Figure [5J The area under study is relatively flat, the altitudes of the sites varying 
from 322 to 910 meters above mean sea level. Data from 16 of the stations were were kept aside for model 
validation and not used for fitting. 

Figure [9] compares the annual maxima over the 16 validation stations, which we term the "groupwise 
maxima" , and the simulated groupwise maxima from the different models. All the max-stable based mod- 
els seem able to model the distributions of the groupwise maxima, though the simple max-stable model 
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Elevation 




Figure 8: Map of the study region. The stations used for inference/validation are depicted by cir- 
cles/triangles. 




Observed values Observed values Observed values Observed values 

Figure 9: QQ-plots to compare the observed maxima of the annual maxima from the validation locations 
and those simulated from various models. From left to right: simple max-stable, conditional independence, 
unadjusted Bayesian hierarchical, adjusted Bayesian hierarchical models. The 95% confidence/ credible en- 
velopes are shown as dashed lines. 

badly overestimates the largest value, perhaps due to inaccurate trend surfaces for the GEV parameters, 
particularly the shape parameter. The conditional independence model shows systematic underestimation, 
confirming that this model is inappropriate. The unadjusted and adjusted Bayesian hierarchical models yield 
similar credible envelopes, which seem principally to reflect the variability of simulated conditional Gaussian 
processes and GEV realizations. 

Figure [10] shows three simulated random fields for each model, taken from a large number of such fields. 
To rank these we took a disk Vzmich of radius 6 km and centered near the Zurich gauging station, and ordered 
the random fields according to their suprema Sz ur i c h — su Pa;ey Zur i C h Y{x). This allows us to summarize the 
intensity of a particular realization of a random field. The three rows of Figure [TU1 correspond to the situation 
where PrfSzurich < Zmt] = ot, where a = 0.05, 0.50, 0.95 respectively and the level z cr it depends on the model 
considered. Roughly speaking, the three rows show patterns for which Szurich is expected to be exceeded 
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Figure 10: Three realizations of random fields over the study region for the conditional independent 
model (Ml), hierarchical models without any adjustment (M2) and with the curvature adjustment (M3) 
and a simple max-stable model with deterministic trend surfaces (M4). The three rows show realizations 
corresponding to different risk scenarios according to the values of 5z U rich expected to be exceeded once 
every 1.05, 2 and 20 years (from top to bottom). 

once every 1.05, 2 and 20 years. 

The conditional independence model leads to unrealistic realizations of extreme rainfall fields, but because 
of the deterministic trend surfaces for the marginal parameters, the simple max-stable model produces fields 
that are too smooth to be realistic. The unadjusted and adjusted hierarchical models seem to produce the 
most plausible realizations. 

Figure [TT] plots return level curves, i.e., graphs of the estimated pth quantile of Szurich an d a similar 
quantity Sdob for the DOB gauging station, against 1/(1 — p), and smaller disks of radius 0.3. For the 
smallest neighborhood, the return level curves are compared to the observations available at the Zurich and 
DOB gauging stations; see Figure [SJ 

As the neighbourhoods of radius 0.3km are very small, the return level curves should be close to the 
empirical curves computed from the data available at the Zurich and DOB gauging stations. This is indeed 
the case for Zurich, where all the models apparently reproduce the distribution of extreme rainfall quite well. 
The results are less convincing for the DOB gauging station where, apart from the adjusted hierarchical 
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Figure 11: Comparison between the return level curves (cm) computed on neighborhoods centered at the 
Zurich (top) and DOB gauging stations (bottom) and having radius 0.3 and 6 km (left and middle panels) 
for the conditional independent model (Ml ), the hierarchical models without any adjustment (M2) and with 
the curvature adjustment (M3) and a simple max-stable model with deterministic trend surfaces (MA). The 
left panels compares the return level curves to the observations available at the gauging stations. The right 
panel is the same as the middle one but shows only the max-stable based models. 



model, all the models seem to overestimate the largest extremes. This situation is similar to that seen 
in Section 14.21 the unadjusted hierarchical model produces a posterior that is too concentrated, while the 
max-stable trend surface model might not be flexible enough. Both models fail to capture the complicated 
spatial behavior of the GEV parameters. 

For the neighbourhoods of radius 6km, the central panel of the figure shows a very strong discrepancy 
between the models, because of their different spatial assumptions. The conditional independence model 
yields unrealistically high return levels, of around 2m for 10-year values, for example. All the max-stable 
based give approximately the same return levels for return periods shorter than 10 years. For larger return 
periods, the unadjusted hierarchical model gives the largest estimates. The same plots for 20 other gaug- 
ing stations depicted the same patterns, suggesting that the unadjusted hierarchical model systematically 
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overestimates the distribution of the supremum in a given neighborhood. 



6 Conclusion 

In this paper, motivated by a real problem in which Bayesian inference seems natural but a full likelihood 
is unavailable, we investigate the usefulness of composite likelihood within a Bayesian framework. The 
posterior distribution obtained from a naive implementation of a composite likelihood can have very poor 
coverage properties, owing to its inappropriate re-use of the data. 

To bypass this hurdle, we propose two modifications of the composite likelihood to recover the usual 
asymptotic distribution of the likelihood ratio statistic at the true value of the parameters 9q. We show 
how these adjustments can be implemented in Markov chain Monte Carlo algorithms and propose two ways 
of integrating them into the Gibbs sampler. Although the approximation degrades with distance from the 
parameter underlying the data, simulation studies show that the proposed framework has coverage properties 
similar to those obtained using the full posterior. 

The work was motivated by a need to flexibly model the marginal distributions when modeling spatial 
extreme phenomena. We construct a Bayesian hierarchical model whose data layer is driven by a max-stablc 
process while the marginal parameters are modeled as realizations of a stochastic process. A spatial extreme 
simulation study showed that this framework is able to capture complex marginal behavior as well as the 
spatial dependence in the data. An application to extreme rainfall around Zurich shows that the approach can 
capture both local dependence due to individual storms and regional dependence due to similar climatologies, 
thus broadening the scope of max-stable modelling beyond its current limits. 
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A Asymptotic distributions of the posterior distributions 

The derivation of the asymptotic normality of the posterior distribution heavily relies on Taylor expansions. 
Let 6 C denote the maximum composite likelihood estimate, let 6* pr i or denote the mode of the prior distribution 
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7r(0), and let 

hl°\6 c ) = -V^fo; C ), Vior(^prior) = ~Vg log 7r(0 prio r). 

For n large enough we have 



ir c (9 | y) cx cxp |^ ot (y; § c ) - -(9 - 6 C ) T hl ot {6 c ){6 - 9 C ) + lo g 7r(0 pr ior) --{6- 9 prior ) T h pliol (6 plior )(6 - 9 priol ) 



j#, h(9 c , #prior) 



where h(9 c , 6» prio r) = K ot (e c ) + h pliol (0 pliol ) and 9 = h(6 c , 6> prio r) 1 {h t ° t {9 c )9 c + h pi i OI {6 pi i OI )6 pliol } . 

Provided the contribution of the prior distribution n(6) vanishes as n — > oo, the strong law of large 
numbers implies that 

n-^Oprior) = + Vior(gprior) | _^ — E[V 2 £ c (9q]Y)] = H(6 ), 

~ j h(9 c ,9 plioI ) \ jh t ° t (9 c )^ /lprior(^prior) fl \ „ 
V = \ ( \ Cc H t/prior > > #0, 



almost surely, and thus n c (9 \ y) ~ {6* , n _1 i/(0 o ) -1 }- 

The derivation of the asymptotic distribution for the magnitude adjustment uses the same argument, 
with a slight modification. As n — > oo, 

fc^p/trjff^or 1 ^)} 

almost surely. Since fc is estimated prior to running the MCMC algorithm, we can assume that k is a (tuning) 
constant that does not depend on 9. Therefore the analogue of h c (9 c ) when using f ma g n in place of is 

(y,9 c ) — > tv{H(6o) 1 J(9o)}H(9q), n -> oo, 

almost surely, from which we conclude that 7r magn (# \y) ~ N {# , (^p) _1 tr{.ff(#o) _1 '/(#o)}-ffX#o) _1 }- 

We conclude with the derivation of the asymptotic distribution of the curvature adjusted composite 
likelihood. By construction we have 

n- 1 /i curv (^ c ) - -n^V&urvG/; — ► H(9 Q )J(9 )- 1 H(9 ), n -> oo, 
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almost surely from which we get that 7r curv (6> \ y) ^ N {9 , n 1 H(9 ) 1 J(9 )H(9 ) . 



B Asymptotic variance inflation 

In this appendix we argue that in many cases in which the densities appearing in the composite likelihood are 
correct, so that they satisfy the first two Bartlett identities, E[Vlog/(Y € Af,6o)] = and E[V 2 log/(Y e 
Ai;6» )]+Var[Vlog/(Y e A l ;9 Q )} = for all i e I, then ti{H(9 )- 1 J(do)} > P = dim(0 o ). This agrees with 
our empirical experience, which is that in many cases tr{H (9q)^ 1 J (9q)} p. 
We first note that 

tv{H{9 a )- 1 J(6 )} - P = tT{H(9 )- 1 J(9 a ) - ld p } = ti[H(9 )- 1 {J(9 a ) - H(9 a )}} > 0. 

Since H(9 )^ 1 is positive semi-definite, the result follows if J{0q) — H{9q) is positive semi-definite, because 
tr{AB} > when both A and B are positive semi-definite. 
On the one hand we have 



H(9 ) = —E 



v 2 ^io g /(y e Ai-,e ) 



iei 



J2 E [V 2 log f(Y e A; 0o)] = ]T Var I v lo § f( Y e A; O )] 



iei 



iei 



because the variance of the score equals the Fisher information for each individual summand. On the other 
hand we have 



J(0 ) = Var 



iei 



= ^Var[Vlog/(Y e Ai-,0 o )]+ ]T E[V log f(Y e A t ;9 )V log f(Y e A 3 ;9 Q ) q 



Thus 



J(9o)-H(0 o )= E[VIog/(y G A;0o)Vlog/(Ye A^o) T ] = X H^Uj + U 3 Uf), 

i,jel,ij^j i,jel,i<j 

say; clearly these expectations are symmetric. To see that they will often be positive definite, let Ai and 
Aj correspond to the events Y e Ai and Y e A,. I 1 these events are independent, then ~&{Ui~Uj) = 0, 
but if not, suppose that that we may write let Ai = A\ n Ay, Aj = A^ n Ay, for some event Ay such 
that A\ and A^ arc independent conditional on Ay. This arises if, for example, in a Markov chain Y E Ai 
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corresponds to {Y\ = yi,Y 2 = y 2 } 1 Y e Aj corresponds to {Y 2 = y 2 ,Y 3 = y 3 }, and we take A! i = {Y\ = y\}, 
Aij = {Y 2 = j/2} and A!- = {Y 3 = y 3 }. If we write pr(Ai) = pr(A' i \ Aij)pr(Aij), then the corresponding log 
likelihood derivative may be written as Ui = U- + Uij in a natural notation, and 

E(UiUf) = E{(£7? + UijWj + U^f) = null!'?) + Vax{Uij) = E{Cov(C^ ) U] \ A tJ )} + Var(^). 

because the cross terms K(UlUij) = E(UjUij) = 0, as may be seen by conditioning on A^. If U[ and 
Uj are independent conditional on A^, then E(UiUj') = Vav(Uij) is positive semi-definite; this would be 
the case in the Markov chain example mentioned above. If they are not independent, but are sufficiently 
weakly correlated conditional on A^ that the term Var(CTjj) is dominant, then E(UiUj) will also be positive 
semi-definite, and hence so will be J (60) — H{6q). This will be the case in typical applications of composite 
likelihood, as terms that correspond to dependent events Ai, Aj will tend to be positively correlated, because 
they are proximate in space or time, or both. 
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